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Abstract. The statistical analysis of tree structured data is a new topic in statistics with wide 
apphcation areEis. Some Principal Component Analysis (PCA) ideas were previously developed for 
binary tree spaces. In this study, we extend these ideas to the more general space of rooted and 
labeled trees. We re-define concepts such as tree-line and forward principal component tree-line for 
this more general space, and generalize the optimal algorithm that finds them. 

We then develop an analog of classical dimension reduction technique in PCA for the tree 
space. To do this, we define the components that carry the least amount of variation of a tree 
data set, called backward principal components. We present an optimal algorithm to find them. 
Furthermore, wc investigate the relationship of these the forward principal components, and prove 
a path-independency property between the forward and backward techniques. 

We apply our methods to a data set of brain artery data set of 98 subjects. Using our techniques, 
we investigate how aging affects the brain artery structure of males and females. We also analyze 
a data set of organization structure of a large US company and explore the structural differences 
across different types of departments within the company. 

1. Introduction 

In statistics, data sets that reside in high dimensional spaces are quite common. A widely used 
set of techniques to simplify and analyze such sets is principal component analysis (PCA). It was 
introduced by Pearson in 1901 and independently by Hotelling in 1933. A comprehensive intro- 
duction can be found in Jolliffe (2002). The main aim of PCA is to provide a smaller subspace 
such that the maximum amount of information is retained when the original data points are pro- 
jected onto it. This smaller subspace is expressed through components. In many contexts, one 
dimensional subspaces are called lines, so we will follow this terminology. The line that carries 
the most variation present in the data set is called first principal component (PCI). The second 
principal component (PC2) is the line such that when combined with PCI, the most variation that 
can be retained in a two-dimensional subspace is kept. One may repeat this procedure to find as 
many principal components as necessary to properly summarize the data set in a manageable sized 
subspace formed by the principal components. 

Another way to characterize the principal components to consider the distances of the data 
points to a given subspace. The line which minimizes the sum of squared distances of data points 
onto it can be considered as PCI. Similarly, PC2 is the line that, when combined with PCI, the 
sum of squared distances of the data points to this combination is minimum. 

An important topic within PCA is called dimension reduction (See Mardia et al (1973) for 
dimension reduction and Jolliffe (2002) pp. 144, for backward elimination method). The aim of 
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dimension reduction method is defined as to find the components such that when ehminated, the 
remaining subspace will retain the maximum amount of variation. Or alternatively, the remaining 
subspace will have the minimum sum of squared distances to the data points. These are the 
components with least influence. 

We would like to note that, in the general sense, any PC A method can be regarded as a dimension 
reduction process. However, Mardia et al (1973) reserves the term dimension reduction specifically 
for this method, which some other resources also refer as backward elimination, or backward PCA. 
In this paper we will follow Mardia et al (1973) 's convention, together with "backward PCA" 
terminology. The original approach will be called forward PCA. 

In general, the choice of which technique to use depends on the needs of the end user: If only 
a few principal components with most variation in them are needed, then the forward approach is 
more suitable. If the aim is to eliminate only a few least useful components, then the backward 

approach would be the appropriate choice. 

The historically most common space used in statistics is the Euclidean space (M") and the PCA 
ideas were first developed in this context. In M", the two definitions of PC's (maximum variation 
and minimum distance) are equivalent, and the components are all orthogonal to each other. In 
Euclidean space, applying forward or backward PCA n times for a data set in R" would provide 
an orthogonal basis for the whole space. Moreover, in this context, the set of components obtained 
with the backward approach is the same as the one obtained by the classical forward approach, 
only the order of the components is reversed. This is a direct result of orthogonality properties 
in Euclidean space. This phenomenon can be referred as path independence and it is very rare 
in non-Euclidean spaces. In fact, this paper may be presenting the first known example of path 
independence in non-Euclidean spaces. 

With the advancement of technology, more and more data sets that do not fit into the Euclidean 
framework became available to researchers. A major source of these has been biological sciences, 
collecting detailed images of their objects of interest using advanced imaging technologies. The 
need to statically analyze such non-traditional data sets gave rise to many innovations in statistics 
area. The type of non-traditional setting we will be focusing in this paper is sets of trees as data. 
Such sets arise in many contexts, such as blood vessel trees (Aylward and Bullitt (2002)), lung 
airways trees (Tschirren et al. (2002)), and phylogenetic trees (Billera et al. (2001)). 

A first starting point in PCA analysis for trees is Wang and Marron (2007), who attacked the 

problem of analyzing the brain artery structures obtained through a set of Magnetic Resonance 
Angiography (MRA) images. They modeled the brain artery system of each subject as a binary 
tree and developed an analog of the forward PCA in the binary tree space. They provided appro- 
priate definitions of concepts such as distance, projection and line in binary tree space. They gave 
formulations of first, second, etc. principal components for binary tree data sets based on these 
definitions. This work has been the first study in adapting classical PCA ideas from Euclidean 
space to the new binary tree space. 

The PCA formulations of Wang and Marron (2007) gave rise to interesting combinatorial op- 
timization problems. Aydm et al. (2009) provided an algorithm to find the optimal principal 
components in binary tree space in linear time. This development enabled a numerical analysis on 
a full-size data set of brain arteries, revealing a correlation between their structure and age. 

In the context of PCA in non-Euclidean spaces, Jung et al. (2010) gave a backward PCA 

interpretation in image analysis. They focus on mildly non- Euclidean, or manifold data, and 
propose the use of Principal Nested Spheres as a backward step-wise approach. 
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Marron et al. (2010) provided a concise overview of backward and forward PCA ideas and 
their applications in various non-classical contexts. They also mention the possibility of backwards 
PCA for trees: "... The notion of backwards PCA can also generate new approaches to tree line 
PCA. In particular, following the backwards PCA principal in full suggests first optimizing over a 
number of lines together, and then iteratively reducing the number of lines." This quote essentially 
summarizes one of our goals in this paper. 

In this work, our first goal is to extend the definitions and results of Wang and Marron (2007) and 
Aydm et al. (2009) on forward PCA from binary tree space to the more general rooted labeled tree 
space. We will provide the generalized versions of some basic definitions such as distance, projection, 
PC, etc., and proceed with showing that the optimal algorithms provided for the limited binary 
tree space can be extended to the general rooted labeled tree space. 

A rooted labeled tree is a tree such that there is a single node designated as a root, and each 
node is labeled in such a way that a correspondence structure can be established between data 
trees. For example, in binary tree context, this means that the left and right child nodes of the 
any node are distinct from each other. In general, the labeling of the nodes greatly affects the 
statistical results obtained from any data set. For the rest of the paper, we will refer to the rooted 
labeled tree space as tree space. 

Next, we attack the problem of finding an analog of dimension reduction. We first provide a defi- 
nition for principal components with least influence (we call these backward principal components) 
in tree space, and define the optimization problem to be solved to reach them. We then provide a 
linear time algorithm to solve this problem to optimality. 

Furthermore, we prove that the set of backward principal components in tree space is the same 
as the forward set, with order reversed, just like their counterparts in the classical Euclidean 
space. This equivalence is significant since the same phenomenon in Euclidean space is a result of 
orthogonality, and the concept of orthogonality does not carry over to the tree space. This result 
enables the analyst to switch between the two approaches as necessary while the results remain 
comparable, i.e., the components and their influence do not depend on which approach is used to 
find them. Therefore path independence property is valid in tree space PCA as well. 

Our numerical results come from two main data sets. First one is an updated version of the 
brain artery data set previously used by Aydm et al. (2009). Using our backward PCA tool, we 
investigate the effect of aging in brain artery structure in male and female subjects. We define 
two different kinds of age effect on the artery structure: overall branchyness and location-specific 
effects. We report that while both of these effects are strongly observed in males, they could not be 
observed in females. Secondly, we present a statistical analysis of the organization structure of a 
large US company. We present evidence on the structural differences across departments focusing 
on finance, marketing, sales and research. 

The organization of the paper is as follows: In Section [2j we provide the definitions of concepts 
such as distance, projection, etc. in general tree space, together with a description of the forward 
approach and the algorithm to solve it. These are generalizations of the concepts introduced in 
Wang and Marron (2007) and Aydm et al (2009). In Section [3] we describe the problem of finding 
the backward principal components in tree space and give an algorithm to find the optimal solution. 
In Section 4 we prove the equivalence of forward and backward approaches in tree space. Section 
[5] contains our numerical analysis results. 
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2. Forward PCA in Tree Space 

In this section, we will provide definitions of some key concepts such as distance, projection, 
etc. in tree space, together with illustrative examples. The binary tree space versions of these 
definitions were previously given in Wang and Marron (2007) and Aydm et al. (2009). We will 
also provide the tree space versions of their PCA results, and prove their optimality in the more 
general tree space. 

In this paper the term tree is reserved for rooted tree graphs in which each node is distinguished 
from each other through labels. The labeling method can differ depending on the properties of 
any tree data set. For labeling binary trees, Wang and Marron (2007) uses a level-order indexing 
method. In this scheme the root node has index 1. For the remaining nodes, if a node has index i, 
then the index of its left child is 2i and of its right child is 2i + 1. (See Figure 1). Labeling general 
trees may get significantly more complicated. 




Figure 1 . Two trees of which nodes are labeled using level-order indexing method. 

The children of any node are distinct from each other. The nodes 1,2 and 3 in the 
left data tree correspond to the nodes 1,2 and 3 in the right data tree. 

A data set, T, is an indexed finite set of n trees. A distance metric between two trees is the 
symmetric difference of their nodes. Given two trees, ti and t2, the distance between ti and t2, 
denoted by d(ti, ^2), is 

\tl\t2\ + \t2\ti\, 

where | • | is the number of nodes and \ is the node set difference. In Figure 1, the nodes 1, 2 and 
3 are common to both of the trees, so they do not contribute to the distance between them. The 
nodes 4,5, 6 and 7 exist in one data tree but not in the other, therefore, the distance between the 
left and right trees in the figure is |{4, 5, 6, 7}| = 4. 

The support tree and the intersection tree of a data set T = {ti, . . . ,tn} are defined as: 

Supp{T) = Uf^iU and Int{r) = nf^^U, 

respectively. 

As before, the line concept is a close counterpart to the lines in Euclidean space. In the most 
general sense line refers to a set of points that are next to each other. These points lie in a 
given direction, which makes the line "one-dimensional" . Due to the discrete nature of tree space, 
the points (trees) that are next to each other are defined the points with distance 1, the smallest 
possible distance between two non-identical trees. To mimic the one-dimensional direction property, 
we require that every next point on the line in tree space is obtained by adding a child of most 
recently added node. The resulting construct is a set of trees that start from a starting tree and 
expands following a path away from the root, which is akin to the sense of direction in Euclidean 
space. A formal definition of a line in tree space is given as follows: 

Definition 2.1. Given a data set T, a tree-line, L = {Iq, . . . , 1^}, is a sequence of trees where Iq 
is called the starting tree, and li is defined from li-i by the addition of a single node Vi G Supp{T). 
In addition, each vi is a child ofvi^i. 
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See Example 2.3 for an example tree-line. 

The next concept to construct is the projection in this space. In general, the projection of a 
point onto an object can be defined as the closest point on the object to the projected point. This 
can be formalized in tree space as: 

Definition 2.2. The projection of a tree t onto the tree-line L is 

Plit) = argmin{d(t, Z)} 

The projection of a data tree onto a tree-line can be regarded as the point in the tree-line most 
similar to the data tree. 



Example |2.3| contains a small data set and a tree-line, and illustrates how the projection of each 
data point onto the given tree-line can be found. 

Example 2.3. Let us consider the following data set consisting of 3 data points. For simplicity, 
we use a set consisting of binary trees only. 



T 



and a tree-line 




t2 



SuppiT) 




L 



In 



U 





The following table gives the distance between each tree ofT and each tree of L: 





/q h h 


h 


3 2 1 




545 




8 7 6 



So, we can observe that PL^ti) = I2, PL{t2) = h <ind PL{t^) = I2. 

Finally, we will define the concept of "path" that will be useful later on. 

Definition 2.4. Given a tree-line L = {Iq, • • • , l^}, the path of L is the unique path from the root 
to Wfc, the last node added in L, and it is denoted by pi. 

Note that our path definition is different than the one given in Aydm et al. (2009), which 
included only the nodes added to the starting tree instead of forming a set starting from the root 
node. 

The next lemma provides an easy-to-use a formula for the projection of a data point. The proof 
of it can be found in the Appendix. 

Lemma 2.5. Let t be a binary tree and L = {Iq, • • • , l^} be a tree-line. Then 

PUt) = ioU{tnpL). 
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It follows that projection of a tree over a tree-line is unique. 

Wang and Marron (2007) gave a definition of first principal component tree-line in the binary 
tree space. It was defined as the tree-line that minimizes the sum of distances of the data points 
to their projections on the line. This can be viewed as the one-dimensional line that best fits the 
data. We will provide their definition below, adopted to the general tree space. We also note that 
this is the "forward PCA" approach where a subspace that carries the most amount of variation is 
sought. We will develop the "backward PCA" approach in the upcoming section. 

Definition 2.6. For a data set T and the set of all tree-lines C in Supp(T) with the same starting 
point Iq, the first (forward) principal component tree-line, PCI, is 

l{ = argmin^^ (i(t, Pi(t)). 



As we will see in Example 2.11, the definition of the principal components allows multiple 
solutions. A tie-breaking rule depending on the nature of the data should be established to reach 
consistent results in the existence of ties. In order to have a tie breaking rule dealing with the 
PC's definition, we assume that the set of all tree-lines is totally ordered. This tie-breaking rule 
(total order) can be induced to the set of paths. Thus, we denote by pi > Pv that the path pi is 
preferred to py. 

For an analogous notion of the additional components in tree space, we need to define the 
concept of the union of tree-lines, and projection onto a union. We say that given tree-lines 
,0) ^1,1) • • • ) ^i.mi}, • • • J Lq — {/q,0) lm,,ii ■ ■ ■ , lq,mq}, their union is the set of all possible 
unions of members of Li trough Lq-. 

LiU---ULq = U • • • U Ml G {1, • • • ,"7.i}, • • • ,ig e {0, • • • ,771,}}. 

In light of this, the projection of a tree t onto Li U • • • U Lg is: 

PLiU-uLg{t) = arg min {d{t,l)} 

l(^LiU---ULq 

Next, we provide the definition of the general k''^ PC: 

Definition 2.7. For a data set T and the set of all tree-lines C in Supp{T) with the same starting 
point Iq, the k-th (forward) principal component tree-line, PCk, is defined recursively as 

Li = argmin^(i(t,P^/^...^^/ ^^^(t)). 

tGT 

The path of the k-th principal component tree-line will be denoted by pj,. 

The following lemma describes a key property that will be used to interpret the projection of a 
tree onto a subspace defined by a set of tree-lines. The reader may refer to the Appendix for the 
proof. 

Lemma 2.8. Let Li, L2, ■ ■ ■ , Lq be tree-lines with a common starting point, and t be a tree. Then 

PL,U-UL,{t)=PLAt)^---^PL,{t) 

Aydm et al. (2009) provided a linear time algorithm to find the forward principal components 
in binary tree space. We will give a generalization of that algorithm in tree space, and prove that 
the extended version also gives the optimal PC's. The algorithm uses the weight function Wk{v), 
defined as follows: 
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Definition 2.9. Let T be a data set and C be the set of all tree-lines with the same starting point Iq. 
Let 6 be an indicator function, defined as S{v,t) = 1 ifv^t, and otherwise. Given l{, . . . ,Lj,_^, 
the first k — 1 PC tree-lines. The k-th weight of a node v £ Supp{T) is 

'O, ifv£loUp{u---Upl_^, 
Z^teT ^(""i otherwise. 



Wk{v) 



The following algorithm computes the A;-th PC tree-line: 

Algorithm 2.10. Forward algorithm. Let T be a data set and C be the set of all tree-lines with 
the same starting point Iq. 

Input: Lj , . . . , the first {k — l)-st PC tree-lines. 

Output: A tree-line. 

Return the tree-line whose path maximizes the sum of weights in the support tree. Break ties 
according to an appropriate tie-breaking rule. 

To better explain how the algorithm works, we will apply the forward algorithm to the toy data 



set given in Example 2.3 



Example 2.11. In this example, we select as tie-breaking rule the tree-line with leftmost path. 
We take the intersection tree as the starting point (illustrated in red below). The table given below 
summarizes iterations of the algorithm, where each row corresponds to one iteration. At each of the 
iterations, the name of the principal component obtained at that iteration is given in left column. 
The support tree with updated weights (w^{.)) is given in the middle column. The paths of selected 
PC tree-lines according to these weights is given in right column. 



PC I 1 2 2 2 1 1 




PC 2 1 2 2 1 1 



PC 3 1 02 20 1 




PC 6 00 00 1 
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The next theorem states that the tree-hne returned by the forward algorithm is precisely the 
k-th PC tree-line. The proof is in the Appendix. 

Theorem 2.12. Let T be a data set and C be the set of all tree-lines with the same starting point 
Iq. Let l{, . . . , be the first {k — l)-st PC tree-lines. Then, the forward algorithm returns the 
kth PC tree-line, Lj,. 

In theory, an arbitrary line would extend to infinity. In this paper, we limit our scope to the 
line pieces that reside within the support tree of a given data set since extending lines outside of 
support tree's scope would introdiicc unnecessary trivialities. Within this restriction, it can be seen 
that the possible principal component tree-lines for a given data set are those that theirs paths are 
maximum (there is no other path in Supp{T) containing pi). We also consider only the tree-lines 
that are not trivial (the tree-line consist of Iq and at least one more point). 

In the light of this, we let C-p denote the set of all maximal non trivial tree-lines with staring 
point contained in Supp(T). Also we name V to be the set of all paths in Supp{T) from the 
root to leaves that are not in Iq. It is easy to see that V is the set of paths of tree-lines in C-p. Also 
note that \C-p\ = \V\ = n and Supp{T) = Zo U Pl- 

PL^P 

3. Dimension Reduction for Rooted Trees 

In this section, we will define backward principal component tree-lines. This structure is the tree 
space equivalent of the backward principal component in the classical dimension reduction setting. 
They represent the directions that carry the least information about the data set and thus can be 
taken out. Our definition describes backward principal components as directions such that when 
eliminated, the remaining subspace will retain the maximum amount of variation. Or alternatively, 
the remaining subspace will have the minimum sum of squared distances to the data points. These 
are considered to be the components with least influence. We also present an algorithm that finds 
these components, and we provide a theoretical result proving the optimality of our algorithm. 

While using the backward approach, we must use the opposite tie-breaking rule we used in the 
forward approach. That is, pl > Pu means that the path pi,/ is preferred to pL. 

Definition 3.1. For a data set T and the set of tree-lines Cp with the same starting point Iq, the 
n*^ backward principal component tree-line, BPCn, is 

Li = arg rnin ^d{t, PuL'eCr\{L}{t))- 

t&T 

The (n — k)*** backward principal component tree-line is defined recursively as 
(3.1) = argmin^g^^^i^, ...^^6_^^^^^^g^d(t,Pyi,g£^\|i, ..._^6_^^^^i|(t)). 

The path associated to the (n — A;)-th backward principal component tree-line will be denoted 
by p\^_f^. The following node weight definition will be key to the upcoming algorithm for finding 
backward components: 

Definition 3.2. Let T be a data set and C be the set of all tree-lines with the same starting point Iq. 
Let L^, . . . , be the last k BPC tree-lines and B = V\ {p^, . . . For v G Supp(B), 

the (n — k)-th backward weight of the node v is 

, ^ ^ ( If V ^ Iq or V belongs to at least two different paths of B 

^"^ lSteT^(^'^) Otherwise. 
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The following algorithm computes the backward principal components. 



Algorithm 3.3. Backward Algorithm. Let T be a data set of binary set and C be the set of all 
tree-lines on Supp{T) with the same starting point Iq. 
Input: L^, . . . , the last k BPC tree-lines. 

Output: L\_j., the (n - kf^ BPC tree-line. 

LetB = V\{pt---,pi.k+i}- 

Return the tree-line L'^_i^ whose path minimizes the sum ofw'j^ weights in the support tree Supp(B). 
If there are more than one candidate, select the tree-line according to an appropriate tie-breaking 
rule (it coincides with the opposite tie-breaking rule used in the forward algorithm). 

As the forward algorithm explained in previous section, the backward algorithm also finds the 
optimal solution in linear time. 

Next, we provide an example illustrating the steps of the backward algorithm. We will apply the 
backward algorithm to the toy data set given in Example 2.3 In this example, we use the same 



starting point as in example 2.11, Furthermore, we use the opposite tie-breaking rule we used in 
the forward algorithm, in this case is to select the rightmost tree-line. 

Example 3.4. The table given below summarizes iterations of the algorithm, where each row cor- 
responds to one iteration. At each of the iterations, the name of the backward principal component 
obtained at that iteration is given in left column. The pruned support tree with updated weights 
(w'^{.) ) is given in the middle column. The paths of selected PC tree-lines according to these weights 
is given in right column. 
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The key theoretical result of the section, the optimality of the backward algorithm, is summarized 

as follows: 

Theorem 3.5. Let T be a data set and C-p he the set of all tree-lines with the same starting point 
Iq for this data set. Let L^, . . . , he the last k BPC tree-lines. Then, the backward algorithm 

returns the optimum (n — A;)*'* BPC tree-line, L^_f,. 

The proof of this theorem is in the Appendix. 

4. Equivalence of PCA and BPCA in Tree Space 

A very important aspect of tree space is that, the notion of orthogonality does not exist. In 
the Euclidean space equivalent of backward PCA, the orthogonality property ensures that the 
components do not depend on the method used to find them, i.e., the most informative principal 
component is the same when forward or backward approaches are used. This powerful property of 
path-independence brings various advantages to the analyst. 

In this section, we will prove that the forward and backward approaches are equivalent in the 

tree space as well when tree-lines are used. This is a surprising result given the lack of any notion 
of orthogonality. In practice, this result will ensure that the components of backward and forward 
approaches in binary tree space are comparable. 

We will show this equivalence by proving that, for each 1 < A; < n, the A*'* PC tree-line and 

the fc*^ BPC tree- line are equal. An equivalent statement is that their paths are equal: p{ = Pfc- 
Without loss of generality, we will assume that a consistent tie-breaking method is established for 
both methods in choosing principal components whenever candidate tree-lines have the same sum 
of weights. All the proofs can be found in the Appendix. 

Proposition 4.1. Given an integer 1 < k < n, let p{, •••,]?](. be the paths of the first k principal 
components yielded by the forward algorithm and p\, ■•-,^^+1 be the paths of the last n — k principal 
components yielded by the backward algorithm, then there exist no i and j such that 1 < i < k < 
j < n and pj = p'j. 

This proposition motivates the following theorem: 

Theorem 4.2. For each 1 < k < n the k*^ PC tree-line obtained by the forward algorithm is equal 
to the k^^ BPC tree-line obtained by the backward algorithm. 

This result guarantees the comparability of principal components obtained by either method, 
enabling the analyst to use them interchangeably depending on which type of analysis is appropriate 
at the time. 

5. Numerical Analysis 

In this section we will analyze two different data sets with tree structure. The first data set 
consists of branching structures of brain arteries belonging to 98 healthy subjects. An earlier 
version of this data set was used in Aydm et al. (2009) to illustrate the forward tree-line PCA 
ideas. In that study they have shown that a significant correlation exists between the branching 
structure of brain arteries and the age of subjects. Later on, 30 more subjects are added to that 
data set, and the set went through a data cleaning process described in Aydm et al. (2011). In our 
study we will use this updated data set. 
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Figure 2. Left panel: Reconstructed set of trees of brain arteries. The colors indi- 
cate regions of the brain: Back (gold), Right (blue), Front (red), Left (cyan). Right 
panel: An example binary tree obtained from one of the regions. Only branching 
information is retained. 

The second data set describes the organizational structure of a large company. The details of this 
data set are propriety information, therefore revealing details will be held back. We will investigate 
the organizational structural differences between business units, and differences between types of 
departments. 

As stated before, we focus on data trees where nodes are distinctly labeled. When constructing 
a tree data set, labeling of the nodes is crucial since these labels help determine which nodes in a 
data tree correspond to the nodes in another, and thus shaping the outcome of the whole analysis. 
The word correspondence is used to refer to this choice. We will handle the correspondence issue 
separately for each data set we introduce. 

5.1. Brain Artery Data Set. 

5.1.1. Data Description. The properties of the data set were previously explained in Aydm et al. 
(2009). For the sake of completeness, we will provide a brief summary. 

The data is extracted from Magnetic Resonance Angiography (MRA) images of 98 heathy sub- 
jects of both sexes, ranging from 18 to 72. This data can be found at Handle (2008). Aylward 
and Bullitt (2002) applied a tube tracking algorithm to construct 3D images of brain arteries from 
MRA images. See also Bullitt et al. (2010) for further results on this set. 

The artery system of the brain consists of 4 main systems, each feeding a different region of the 
brain. In Figure [2] they are indicated by different colors: gold for the back, cyan for the left, blue 
for the right and red for the front regions. The system feeding each of the regions are represented 
as binary trees, reduced from the 3D visuals seen in Figure [2] The reason for this is to focus on 
the branching structure only. Each node in a binary tree represents a vessel tube between two split 
points in the 3D representation. The two tubes formed by this split become the children nodes of 
the previous tube. The initial main artery that enters the brain, and feeds the region through its 
splits, constitutes the root node in the binary tree. The binary tree provided in Figure [2] (right 
panel) is an example binary tree extracted from a 3-D image through this process. 

The correspondence issue for this data set is solved as follows. At each split, the child with more 
number of nodes that descent from it is determined to be the left child, and the other node becomes 
the right child. This scheme is called descendant correspondence. 

The study of brain artery structure is important in understanding how various factors affect 
this structure, and how they are related to certain diseases. The correlation between aging and 
branching structure was shown in previous studies (Aydm et al. (2009), Bullitt et al. (2010)). 



12 CARLOS A. ALFARO, BURCU AYDIN, ELIZABETH BULLITT, ALIM LADHA, AND CARLOS E. VALENCIA 

The brain vessel structure is known to be affected by hypertension, atherosclerosis, retinal disease 
of prematurity, and with a variety of hereditary diseases. Furthermore, results of studying this 
structure may lead to establishing ways to help predict risk of vessel thrombosis and stroke. Another 
very important implication regards malignant brain tumors. These tumors are known to change and 
distort the artery structure around them, even at stages where they are too small to be detected 
by popular imaging techniques. Statistical methods that might differentiate these changes from 
normal structure may help earlier diagnoses. See Bullitt et al. (2003) and the references therein 
for detailed medical studies focusing on these subjects. 

5.1.2. Analysis of Artery Data. The forward tree- line PCA ideas were previously applied to an 
earlier version of this data set. Our first theoretical contribution of this paper, extension of tree- 
line PCA to general trees, does not effect this particular data set since all trees in it are binary. 
Therefore we first focus on the dimension reduction approach we bring. In Aydm et al. (2009), 
only first 10 principal components were computed, and age effect were presented through first 4 
components. In general, the main philosophy of our dimension reduction or backward technique 
is to determine how many dimensions need to be removed for enough noise to get cleared from 
the data set before the statistical correlations become visible or significant. We ask this question 
for the brain artery data set and the effect of aging on it, on the updated brain artery data set. 
Also, Aydm et al. (2009) had used the intersection trees as the starting point in calculating the 
principal components. In this numerical study, we will use the root node as the starting point of 
the tree- lines. 

An observation on this data set, or any data set consisting of large trees is the abundance of 
leaves. Many of the leaves of the trees exist in one or few number of data trees. This leads to 
support trees that are much larger than any of the original data trees. The underlying structures 
are expected to be seen in upper levels, and most of the leaves can in fact be considered as noise. 
In our setting, the leaves that only exist in one or few data trees make up the first backward 
components. A question to ask is, what percentage of variation is created by the low- weight leaves, 
and what percentage is due to the high- weight nodes, or underlying shape? Figure [3] provides two 
plots that illustrate an answer. 

In Figure [3| the number of backward components removed from the tree space data is in, versus 
the total variation explained by the remaining subspace is shown (left panel). The Y values at 
the X = point correspond to the total variation before any components are removed. This value 
is different for each subpopulation, as the sizes of their support trees are different. As backward 
components are removed from each of the sub-spaces, the variation covered decreases. We can 
observe that the initial backward components carry very little variation, and therefore result in 
a very small drop in the total number of explained nodes by the remaining sub-space. This is 
caused by the very large amount of leaves that aren't part of any underlying structure. The Y = 
points for each of the curves mark the total number of principal components that cover the whole 
data. This number is in fact equal to the number of leaves on the support trees of each of the 
subpopulations. 

On the right panel, we see the same information, only the X and Y axes for each of the curves 
are scaled so that the maximum corresponds to 100. The first observation we see in this graph is 
that, the curves are almost plotted on top of each other: even if the sizes of their support trees 
are much different, the same percentage of variation is explained by same percentage of principal 
components in each of these data sets. We can conclude from this that the variation is structured 
similarly for each of these subpopulations. The second observation is that, the majority of the 
principal components explain very little variation. In the right panel of Figure [3j we see that for 
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Number of Components Scaled Number of Components 

Figure 3. Left panel: X axis represents the total number of backward principal 
components removed from data. Y axis represents the number of nodes (variation) 
explained by the remaining subspace after removal. Four subpopulations are shown: 
Back (blue), Left (red), Right (magenta), Front (green). Right panel: Same infor- 
mation as the left panel is used. For each subpopulation, the total variation and the 
number of total backward principal components are scaled so that the maximum is 
100. 

all the subpopulations, the first 70% of the principal components only cover 10% of the nodes, and 
the last 10% of these components explain about 70%. This data set is known to be very high- 
dimensional (about 270 for the back subpopulation). However, Figure [S] shows that a very small 
ratio of them are actually necessary to preserve the underlying structures. 

Our next focus is to see, during the backward elimination process, at which points the age- 
structure correlation is visible. 

It was established previously that the branching of brain arteries are reduced with age. Bullitt 
et al. (2002) noted an observed trend on this phenomenon, while Aydm et al. (2009) showed this 
effect on left subpopulation using principal components. In this paper, for each subpopulation, we 
start from the whole subspace and reduce it gradually by removing backward principal components. 
At each step the data trees are projected onto the remaining subspace. The relationship between 
the age of each data point and the size of the data tree projection is explored by fitting a linear 
regression line to these two series. These plots are not shown here, but similar ones can be found 
at Aydm et al. (2009). This line tends to show a downward slope, suggesting that the projection 
sizes are reduced by age. To measure the statistical significance of the observation, the p-values 
are found for the null hypothesis of slope. Figure [4] shows the the plots of p-values at each step 
of removing BPC's, for each subpopulation. The p-values are scaled using natural logarithm while 
the Y axis ticks are left at their original values. The rule-of-thumb for the p-value is that 0.05 or 
less is considered significant. For tight tests, 0.01 can also be used. Figure [4] provides grey lines for 
both of these levels for reference. 

In Figure |4] we see that, the front subpopulation does not reach the p-value levels that are 
considered significant at any sub-space. The front region of the brain, unlike the other regions, do 
not get fed by a direct artery entering the brain from below, but it is fed by vessels extending from 
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Figure 4. X axis represents the scaled number of backward principal components 
removed from the subspace of each of the subpopulations. At each X value, the data 
points are projected onto the remaining subspace. The sizes of these projections, 
plotted against age, show a downward trend (not shown here). Statistical signifi- 
cance of this downward trend is tested by calculating the standard linear regression 
p-value {Y axis) for the null hypothesis of slope. Y axis is scaled using natural 
logarithm, while the Y axis ticks are given in original values. The grey horizontal 
lines indicate 0.05 and 0.01 p-value levels. The subpopulations are colored as: Back 
(blue). Left (red). Right (magenta). Front (green). A statistically significant age 
effect is observed for subpopulations Back, Left and Right. 



other regions. (See Figure [2]). Therefore it is not surprising that the front vessel subpopulation 
does not carry a structural property presented by the other three subpopulations. 

For other subpopulations, we identify two different kinds of age-structure dependence. First, 
for left and back subpopulations, the age versus projection size relationship is very sharp until 
the last 5% of the components are left. Most of the early BPCs correspond to the small artery 
splits that are abundant in younger population, which people tend to lose as age increases (Bullitt 
at al. (2002)). Therefore the overall branchyness of the artery trees are reduced. Figure |4] is 
consistent with this previous observation. The p-value significance gets volatile at the last 5% of 
the components, where the BPCs corresponding to the small artery splits are removed, and only 
the largest components remain in the subspace. These largest components correspond to the main 
arteries that branch the most. The location-specific relationship between structure and age, noted 
in Aydm et. al. (2009) can be observed for left and back subpopulations towards the end of the X 
axis. This is the second kind of dependence we observe in the data sets. For right subpopulation, 
we only observe the first kind, and it does not seem to be as strong as left and back subpopulations. 

Our second focus is to repeat the question of age-structure relationship for the male and female 
subpopulations. Our data set consists of 49 male, 47 female and 2 trans-gender subjects. We run 
our analysis for the largest two groups to see how aging effects males and females separately. 
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Figure 5. The left and right panels are the p- value versus subspace plots for female 
and male populations. The axes are as explained in Figure |4j The subpopulations 
are colored as: Back (blue), Left (red), Right (magenta). Front (green). For males, 
a statistically significant age effect is observed for subpopulations Back, Left and 
Right. No such effect is observed for females. 

In Figure[5j the p- value versus subspace graphs are given for the male and female subpopulations. 
As before, the front subpopulation does not show any statistical significance at any subspace level. 
For the other subpopulations, a clear difference between male and female groups emerges. 

For the female group, the first kind of structural affect of age (overall branchyness) cannot be 
observed for any subpopulation. For the location-specific relationship (branchyness of the main 
arteries) the lowest p-value that could be achieved comes from the right subpopulation at 0.5015, 
slightly higher than the rule-of-thumb significance level of 0.05. 

For the male group, the age versus overall branchyness can be observed for left, right and back 
subpopulations at very significant levels (below 0.01 p- values). The location-specific relationship 
can again be observed for these three subpopulations at significant levels. 

The study on the full data set implies that two kinds of age-structure relationships can be 
observed in the whole population using this method. Subsequent analysis of male and female 
groups shows that the same effects are observed, more strongly, in the male group. Meanwhile, 
no statistically significant age effect could be observed in the female group using these methods. 
These results suggest that the brain vessel anatomy of male and females may respond differently 
to aging: The overall branchyness and the branchyness of longest arteries get reduced by age in 
males, while these affects aren't apparent for the female group. Therefore the effects observed in 
the whole population may in fact be driven by the male sub-group. 

5.2. Company Organization Data Set. 

5.2.1. Data Description. In this analysis, we use a company organization data set of a large US 
company. This data set is a snapshot of the employee list taken sometime during the last ten years. 
It also includes the information on hierarchical structure and the organizations that employees 
belong to. The set includes more than two hundred thousand employees active at the time when 
the snapshot was taken. In this section we will explain the general aspects of the data set that are 
relevant to our analysis, but we will hold back any specifics due to privacy reasons. 
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The original company structure can be considered as one giant tree. Each employee is represented 
as a node. The CEO of the company is the root node. The child-parent relationships are established 
through the reporting structure: the children of a node are the employees that directly report to 
that person in the company. Since every employee directly reports to exactly one person (except the 
CEO, the root node), this system naturally lends itself to a tree representation. A vert important 
structural property of organization trees is that, each higher-level employee usually has many 
employees reporting to him/her. Therefore this organization tree is not binary, but a general 
rooted tree. It has a maximum depth of 13 levels. 

The company operations span various business activities, each main category being pursued by 
a different business unit of the company. The heads of each of these business units report directly 
to the CEO. Every person working in the company is assigned to one business unit, and these 
units form the first level of organization codes. These business units are further divided into sub- 
organizations, primarily with respect to their geographical locations around the world. A third 
level of hierarchy again divides these units based on territory and job focus. The last organization 
level, which we will be using to construct our data sets, is the fourth level of the hierarchy, and is 
used to define departments that arc dedicated to a particular type of job for a particular product 
or service. For example, the Marketing department responsible of promoting a product group in 
a given region of one of the business units is an organization at the fourth level of hierarchy. Just 
like the business unit, every person in the company is assigned to an organization code of second, 
third and fourth levels. A person working in a particular department shares first, second, third and 
fourth levels of organization codes with her colleagues working in the same department. 

In this study we will focus on populations of different departments across the company that are 
assigned to a similar type of job. When the whole organization tree is considered, the directors of 
these departments are at the fifth level of that tree. To form our data set, we gathered the list of all 
the directors in the company who are at the fifth level. Then, based on the organization codes, we 
determined the main job focus of the departments that the directors are leading. We selected four 
main groups of jobs to compare for our study: finance, marketing, research and development, and 
sales. The departments that focus on one of these four categories are assigned to those categories. 
Other departments that focus on different jobs, like legal affairs or IT support, are left out. For 
each category, each department assigned to that category forms one data point. The director of 
that department is taken as the root node of the data tree representing the department, and the 
people who work at that department are nodes of this tree. The structure of the tree is determined 
by the reporting structure within the department. 

The correspondence issue within the data sets requires some attention. A job-based corre- 
spondence scheme between two data trees would involve determining which individuals in one 
department perform a similar function to which individuals at the same reporting level in another 
department, so that the nodes of those people can be considered "corresponding". With the excep- 
tion of the directors (who form the root nodes and naturally correspond to each other), this kind 
of matching is virtually impossible for this data set, since job definitions within one department 
greatly depends on the particulars of that department's job, and may not match with jobs within 
another department. Since this job-based correspondence is not possible, we employ the descen- 
dant correspondence for the data points. Descendant correspondence was elaborated before for the 
binary tree setting. In the general tree setting, it works in a similar setting: for the nodes that 
are the children of the same parent node, the order from left to right is determined by the total 
number of descendants of each of them. That is, the node with the most number of descendants is 
assigned as the left-most child, and so on. 
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The data set of finance departments constructed in tliis fashion consists of 37 data trees, with 
a maximum depth of 6 levels. The marketing set has 60 trees, maximum depth of 5, sales has 41 
trees, maximum depth 5, and research data set has 20 trees, maximum depth 6. The support trees 
of these sets can be seen in Figure [6} 

Visualizing the organization trees require a somewhat different approach than binary trees. The 
depth of these trees is not very large: 6 levels for the deepest data point. However, the node 
population at each level is very dense. Therefore a radial drawing approach is used to display them. 
(See Di Battista et al. (1999) for details on this method and many others for graph visualization.) 
In radial drawing of rooted trees, the root node is at the origin. The root is surrounded by concentric 
circles centered at the origin. We plot our nodes on these circles, each circle is reserved for the 
nodes in one level of the tree. The coordinate of each node on a circle is determined by the number 
of descendants count. For example, for the nodes on the second level, the 360 degrees available on 
the circle is distributed to the nodes with respect to the number of descendants they have. Nodes 
with more descendants get more space. The nodes are put at the middle of the arc on the circle 
corresponding to the degrees set for that node. The children of that node in the next circle share 
these degrees according to their own number of descendants. This scheme allows the allocation of 
most space on the graph to the largest sub-trees and the distribution of nodes on the graph space 
as evenly as possible. 

5.2.2. Analysis of Company Organization Data. The comparative structural analysis of these four 
organization data sets is conducted via the principal component tree-lines. We have run the di- 
mension reduction method for general rooted trees as described in Section [3j although the forward 
method of Section [2] would have given the same set of components, as shown in Section 4. 

The principal components obtained with this analysis are shown in Figure [6| They are expressed 
through the coloring scheme. A color scala starting from dark red, going through shades of yellow, 
green, cyan and blue and ending at dark blue is used. The components that have higher sum 
of weights (X]^'(^)) colored with the shades on the red side, and lower sum of weights get 
the cooler shades. Since the backward principal components are ordered from low sum of weights 
^^w^k) to higher, this means the earlier BPC's (lower impact components) are shown in blue, 
while the stronger components are in yellow to red part of the scala. The color bar on the right of 
each support tree shows which ^t(;'(A;) corresponds to which shade for that support tree. 

The first conclusions on the differences across types of departments come from the comparison of 
their support tree structure. It can be clearly seen that the sales departments are larger than others 
in population. Another clear distinction is in the flatness of each organization type. Typically, a 
flat organization does not have many levels of hierarchy, and most of the workers are do not have 
subordinates. This is common in organizations of a technical focus. In Figure [6j we can see that 
the research departments are visibly flatter than other three types: most of the nodes are at the 
leaves and not at the interim levels. This is due to the fact that most of the employees in these 
departments do engineering-research type of work, for which a strongly hierarchical organizational 
model is less efficient. The other three data sets, finance, marketing and sales have most of their 
employees on interim levels, pointing to a strong hierarchy. This seems especially strong in sales 
departments. 

In the next figure (Figure [?]) , the effect of reducing the principal components gradually on the 
amount of nodes explained is shown. This figure is constructed in the same way as Figure |3j right 
panel. Figure [7| shows that none of the organization data sets have a very concave variation- versus- 
components curve like the brain artery set did. Therefore for the organizational structure setting, 
the earlier BPC's have more potential to carry information compared to the artery setting. Between 
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Data Set Finance, All Components Data Set Marketing, Ali Components 




Data Set Research, All Components Data Set Sales, All Components 




Figure 6. Radial drawings of the support trees of four organization subsets: Fi- 
nance, Marketing, Research and Sales. The root nodes are at the center. The 
principal components are represented through colors: Earlier BPC's start from the 
blue end of the color scala while the latter BPC's go towards the red end. Nodes that 
are in multiple components are colored with respect to the highest total weighted 
component they are in. The color bar on the right of each panel shows the coloring 
scheme according to the total weight of each BPC. 



the organization data sets, we see that the curves belonging to research and sales are very close 
to each other (the less concave pair), while the curves of finance and marketing are shape- wise 
close (the more concave pair). The concavity of these curves depend on what percentage of the 
variation is explained by the early BPC's, and what percentage by the later, stronger components. 
A very concave curve means that most of the nodes of the data set can in fact be expressed through 
a small number of principal components. This means that the structures within the data points 
are not very diverse: the data trees of the set structurally look like each other, allowing a smaller 
number of PC's to explain more of the nodes. Vice versa, a less concave curve points to a data 
set where a small portion of the principal components are not enough to explain many nodes due 
to the diversity in the structures of the data points. Figure [7] shows that finance and marketing 
departments are more uniformly structured than research and sales departments. I.e., two random 
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Figure 7. The X axis is the number of backward principal components subtracted 
from the subspace. The Y axis is the amount of nodes that can be explained by 
the remaining subspace at each X level. Both of the axes scaled within themselves 
so that the highest X and Y coordinates for all of the organization curves are 100. 
The blue curve is for research, green is for marketing, black is for sales and red is 
for finance. 

finance data trees are more likely to have a shorter distance to each other than two random research 
data trees. 

A variation-versus-components curve is helpful in establishing the trend in the distribution of 
variation within the data set: the earlier BPC's express nodes that are not common across the 
data points, and the later BPC's cover the nodes that are common to most data points. The 
next, and more in-depth question is that, how these more common and less common nodes are 
distributed among the data points themselves? To answer this question, we divide the set of all 
BPC's into two subsets. The first 90% of the BPC's on the X axis of Figure [7] form the one set 
(SET 2). These BPC's collectively represent the subspace where the less-common- nodes are in. 
The remaining 10% of the BPC's form the other set (SET 1). These BPC's express the subspace 
where the more common structures are in. For any data tree t, the projection of it onto SET 1 
(PsETiit)) represents the portion of the tree that is more common with other data trees in the 
data set. The projection of t onto SET 2 {PsET2{t)) carries the nodes of it that are less common 
with others. Since these two sets are complementary, the two projections of t would give t itself 
when combined: PsETiif) U PsET2{t) = t. 

Figure [8] shows how the nodes in SET 1 and 2 are distributed among the data trees for each of 
the organization data sets. For each data point, the length of its projection onto SET 2 is on the 
Y axis, and the length of its projection onto SET 1 is given on the X axis. Each of these axes 
are scaled such that the highest coordinate for each data set is 1 on each of the axes. Blue stars 
denote the research data points, green squares are marketing data points, black crosses are sales 
data points and red circles are finance data points. In Figure [8j it can be seen that none of the 
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Figure 8. The data points of each of the data sets: Research (blue stars) , marketing 
(green squares) , sales (black crosses) and finance (red circles) . For each of the data 
points, the length of its projection onto SET 2 is on the Y axis, and the length of 
its projection onto SET 1 is given on the X axis. Each of these axes are scaled such 
that the highest coordinate for each data set is 1 on each of the axes. 



data points are above the 45 degree line. This is an artifact of the descendant correspondence. 

A very interesting aspect of Figure [8] is that, the data points of each data set visually separate 
from each other. This is especially true for the marketing departments which follow a distinctly 
more convex pattern compared to other kinds of departments. 

For finance departments, we observe an almost linear trend, starting from around X = 0.3. The 
bottom left data points are trees that are small in general: they contain little of the common nodes 
set and almost none of the non-common set. As we go top-right, the trees grow in SET 1 and 
SET 2 spaces proportionally. A similar pattern is there for sales departments, with the exception 
of a group of data points lying on the X axis, pointing to a group of very small departments that 
only consist of the main structure nodes. The research departments follow a lower angle pattern. 
However, this might be due to the one outlier department at the coordinate (1, 1), pushing all 
others to the left/bottom of the graph. 

The most significant pattern on this graph belongs to the marketing group. Unlike other depart- 
ments, there is no linear alignment trend. The set seemingly consists of two kinds of departments: 
First is the group with very little projection on SET 2, and varying sizes of projection on SET 1. 
These are relatively small departments. The second is a group of departments that contain all the 
nodes represented by SET 1 (therefore the 'common structure' part of the trees are common to 
all of these trees), and varying, but large amounts of nodes represented in SET 2. These trees are 
much larger than the first trees of the group. These two different modes of structure within this 
group may be due to particular kind of marketing activity, product family, etc they focus on. The 
details of activities of each department is not part of our data set, therefore we are not able to offer 
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a reason for this separation. Note that two data sets that are shown to be structurally similar in 
Figure [7j finance and marketing, are the furthest apart sets in Figure [8} This is because Figure 
[7] focuses on the overall dispersion of variation, while Figure [8] focuses on the relative differences 
between the individual data trees. 



6. Appendix 



Proof of Lemma 12. 5t 

Since li = U Vj, we have that 

d{t, k 



d{t, h-i) + 1 otherwise. 



In other words, the distance of the tree to the line decreases as we keep adding nodes of pi that 
are in t, and when we step out of t, the distance begins to increase. □ 

Proof of Lemma 12. 8t 

For simplicity, we only prove the statement for q = 2. Assume that 

Ll = {h,0, h,l, ■ ■ ■ , ^l,fci}, -^^2 = {^2,0, ^2,1, • • • , ^2,fc2} 

with Iq = lifl = l2fl, and 

h^i = U vi^i for 1 < i < fci, 

h 



Also assume 

(6.1) 

and 

(6.2) 



'2 J = ^2j-i U V2j for 1 < j < k2. 



PL,{t) = h, 



r2- 



Let f{i,j) be the distance between the trees t and li^i U I2J, for 1 < i < ki and 1 < i < k2. 
Using lemma 2.5 equation (6.1) means 



Hence, 
(6.3) 

By symmetry, we have 
(6.4) 



vi^i G t, if « < n, and 
vij G t, if j <r2. 

f{i,j)<f{i-^,j), ifi<n, 
f{ij)>f{i-'^,j), ifi>ri. 

f{i,j) < f{i,j -I), ifj<r2, 
f{i,j) > fii,j -I), ifi>r2. 



Overall, equations (6.3) and (6.4) imply that the function / attains its minimum at i = ri,j = r2, 
which is what we had to prove. □ 

Proof of Theorem [2TT2t 
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The definition of k^^ PC tree-line in terms of paths is equivalent to the equation 
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/o u p( u • • • u p{_i) I - I (m pl) \ (/o u pf u • • • u p{_i) 



arg max > Wk(v). 
PLeV ^ 
vepL 



The last equation correspond to the path with maximum sum of Wk weights in the support tree. □ 
Proof of Theorem 13. 5t 
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The definition of k^^ BPC tree-line (see Equation 3.1) in terms of paths is equivalent to the 
equation 



fn—k 



arg min V d {t, Iq U ((UpeB\{pi}P) ^ *)) ' "^^^^^ B = V\ {p^, . . . 



t\loU ((UpeB\{p^}P) n t) I + |/o U ((UpeB\{p^}P) n t) \ t| 

t\ku ((UpeB\{p^}p) n t) I + |(/o \ t) u (((UpeB\{p^}p) nt)\t)\ 

A^oU ((UpeB\{pi}P) nt)| + |/o\i| 

i\^oU ((Up6B\{p^}p) nt)| 

t\loU (UpeB\{pi}P) I 

(t n PL ) \ (Zo U (UpeB\{p^ }P) ) I + E I ^ (UpeP\BP) ) \ (^o U (UpeBP) ) | 



arg min > 

arg min > 

arg min > 

arg min > 

arg min > 

arg min y 

arg min > 

arg min > 

^^"^ »;e (tnpL )\ (iou (UpgB\{pi }P) ) 



(tnpL)\ (ZoU (UpgB\{p^}P))| 



= arg min > w'i,(v). 
From the last equation the result follows. 



□ 



Proof of Proposition 4.1 



Suppose there exist i and j with l<i<k<j<n and pj = Pj- Without loss of generality, 
suppose that j is the largest index where the assumption holds. Let pL denote the path p{ = Pp 
and let B = {p^, Since l<i<k<j<n, the set of paths V \ {B} contains at least 

two paths. Let f G be the first node from the leaf to the root that has at least two children in 
Supp{V \ {B}). There are two possibilities: 

/. V ^ Iq i.e. there is at least one path different pL in V \ {B} that has v as node or 
II. V £ Iq. 

In both cases, w'j{u) = for all u in the path p^ from v to the root. 

Consider case /. Let p^i £ V \ {B} be a path different from pj^ that contains v in it. Let py be 
the path from the root to v. Since Pl = p'j 

(6.5) Yl = E ^ E = E 

ugplXpv ^&pl uepj;^/ uepj;^,\pv 

On the other hand, since Pl = p{ 

(6.6) ^ Wi{u) > ^ Wi{u). 

u£p]^ u£pj^, 
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Next, we need to show that following holds: 

(6.7) Y ^(^)^ E ^^(^)- 

To do this, suppose that "Y^ui^p^i^p^w'jiu) > X]ugp^,\p^ t^iC^*)- It implies that there is at least 
one node v' that has Wj{v') > and Wi{v') = 0. Since Wi{v') = 0, a path that contains v' and is 
different oi pL' was yielded by the forward algorithm before pi'- However, this implies that there 
are at least two paths that has v' as node at step j in the backward algorithm, then w'j{v') = 0. 
This gives a contradiction. 

It is straightforward to see 

(6.8) E ^*(^)^ E ^K^)- 

uepLXPv u^plKpv 
Let us suppose that the inequality in ( |6.5[ ) is strict, i.e. 

(6.9) E ^i(^)< E ^K^)- 

u&pl\pv uepi^i\pv 



We have 



E ^^(""^ 

u&Pl 





E 


(n) 


+ 


E '^'^^'> 










u€pl\Pv 




E 


(n) 


+ 


E ^K^) 














E 


(n) 


+ 


E ^» 




UGPv 










E 


(n) 


+ 


E 













which is a contradiction to equation ( |6.6[ ). Therefore, equation ( |6.5[ ) has to be an equality, i.e. 

(6.10) E ^i(^)= E 

u&pl\Pv u&Pl'\Pv 

If one or both equations 

E w'j{u) < E 'Wi{u) and E Wi{u) < E '^i'j(^)) 
uep£/\p« uepi^i\Pv ugplXpv uepL\pv 

holds, then the result follows in the same way as above. Finally, let us suppose 

w'j{u) = Y^ Wi{u) and 'Wi(u) = w'j{u), 

u&Pi^i\pv uepj^/\pv ugplXpv ugplXpv 

which implies that 

Y2 Wj{u) = Y2 ^j(^) ^l^d Y^ Wi{u) = Y^ Wi{u). 
uGpj^i uepL u&Pl' u&pl 

Now, since p( = pi, we have pi > Pv ■ And, since = pi, we have pL < Pl'- Which is a 
contradiction. 
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In the case //, where v G Iq, let v' be the last node from the root to the leaf in that belongs 
to Iq. Take p^' £ V \ {B} as a different path of pL, and v" as the last node from the root to the 
leaf in p^.' that belongs to Iq. Let py' be the unique path from the root to the node v' and p^" the 
unique path from the root to the node v" . Since p^i and p^" are contained in Iq, we have 



wiiu) = Wi{u) = Y ^ji^) = "'K^) = 0. 



Since pL 
(6.11) 



E 

U&PL 



Wj{u) < Y ^(^) 



On the other hand, since pi 
(6.12) 



P. 



Y ^'^'^^ - 



Wi{u). 



Similar to case I, we can see that 6.11 is an equality. This gives a contradiction. 
Proof of Theorem 14. 2t 



□ 



By the proposition 4.1, we have that at step n — 1 of the forward algorithm there is no tree-line 
yielded by the forward algorithm equal to L^, then = L^. At the step n — 2, there is no tree- line 



yielded by the forward algorithm equal to L^^ or L^^-i- Since = Ln, we have the L'^_i = L 



We continue iteratively until step 1. At the end, we will have = for all 1 < < n. 



/ 

n~l- 

□ 
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